October 21, 2016, Hopkins Marine Station, Stanford University

Outline

  • My origins in programming, data science, and open science

  • Improving reproducibility, collaboration and communication in environmental science with open science tools
    • Lowndes et al., in prep
    • reproducibility is fundamental to science, but rarely tested
    • these tools have fundamentally changed how we do science
  • Resources and recommendations
    • exposure and confidence: learning the tools available and using them

Data science and open science

Data Science:

"an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge" (Grolemund & Wickham 2016)

Data science and open science

Data Science:

"an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge" (Grolemund & Wickham 2016)

Data science and open science

Data Science:

"an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge" (Grolemund & Wickham 2016)

Open Science:

"the concept of transparency at all stages of the research process, coupled with free and open access to data, code, and papers" (Hampton et al. 2014)

Open science tools

Open science workflow

Open science workflow

My programming origins story

Photo credit: Greg Auger

Some thesis questions

  • what are Humboldt squid habitat preferences?
  • what season are they most abundant?
  • how fast and far can they migrate?
  • how do they interact with other species?
  • how do I import my data when it's too big for Excel?
  • how do I subset years or other attributes?
  • how do I visualize this?
  • how on earth do I even think about this?

Conflated questions

Science:

  • what are their habitat preferences?
  • what season are they most abundant?
  • how fast and far can they migrate?
  • how do they interact with other species?









Data science:

  • how do I import my data when it's too big for Excel?
  • how do I subset years or other attributes?
  • how do I visualize this?
  • how on earth do I even think about this?

I learned to program like many do

  • in a panic
  • for a single purpose (get this thesis done!)
  • in isolation*


I learned to program like many do

  • in a panic
  • for a single purpose (get this thesis done!)
  • in near-isolation*


* except for wonderful programming mentors:

Steve Haddock, Dave Foley, Ashley Booth

NCEAS, UC Santa Barbara

TODO: image

Ocean Health Index

method to categorize benefits that oceans provide to people

scores are modeled using existing data; data intensive

Ocean Health Index

method to categorize benefits that oceans provide to people

scores are modeled using existing data; data intensive

method can be tailored to different geographies

can help inform policy decisions, especially when repeated

OHI Global Assessments

OHI Global Assessments

2013: second annual global assessment

  • repeat methods
  • update data
  • compare between years

OHI Global Assessments

2013: second annual global assessment

  • repeat methods
  • update data
  • compare between years


We expected to easily reproduce our previous work. We had planned ahead:

  • coded models
  • 130 pages of published supplemental material
  • internal documents and notes

We thought we were doing reproducible science

We struggled to reproduce our work using standard approaches to reproducibility and collaboration

We thought we were doing reproducible science

We struggled to reproduce our work using standard approaches to reproducibility and collaboration

…mainly due to our approaches to data preparation…i.e. data science

We thought we were doing reproducible science

We struggled to reproduce our work using standard approaches to reproducibility and collaboration

…mainly due to our approaches to data preparation…i.e. data science

  • added challenge of managing multiple years of information and scores
  • we needed a nimble approach to sharing data, methods, and results within and outside our team

We identified three main challenges to overcome

  1. reproducibility, including transparency and repeatability, esp. data prep
  2. collaboration, including teamwork and internal collaboration
  3. communication with scientific and public communities

We identified three main challenges to overcome

  1. reproducibility, including transparency and repeatability, esp. data prep
  2. collaboration, including teamwork and internal collaboration
  3. communication with scientific and public communities




Lowndes et al. Improving reproducibility, collaboration, and communication in environmental science using open science tools, in prep

  • exposure and confidence
  • evolution rather than revolution

Addressing challenges using open science tools

Reproducibility - data preparation

"Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data, before it can be explored for useful information." - NYTimes (2014)

  • transforming, rescaling, gap-filling, formatting, etc.
  • seldom mentioned but underpins the scientific process

Reproducibility - data preparation

"Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing data, before it can be explored for useful information." - NYTimes (2014)

  • transforming, rescaling, gap-filling, formatting, etc
  • seldom mentioned but underpins the scientific process

Reproducibility - data preparation

Reproducibility - data preparation

Before

  • manually (without coding)
  • largely Microsoft Excel
  • internal documents and emails

After

  • full process coded
    • R with documentation
    • RMarkdown

Reproducibility - data preparation

Reproducibility - data preparation

Reproducibility - version control

TODO: version control quote

Reproducibility - version control

TODO: version control quote

Reproducibility - version control

Before

  • filenames suffixed with dates, initials
    • e.g. final.csv and final_JL-2016-08-05.csv
  • email descriptions of what changed between files

After

  • version control with git
  • short messages accompany commited changes

Reproducibility - version control

Collaboration - communication + file sharing

TODO: Collaboration quote

Collaboration - communication + file sharing

TODO: Collaboration quote

Collaboration - communication + file sharing

Before

  • email chains (often forwarded)

After

  • GitHub issues

Collaboration - communication + file sharing

Demo link (private)

Communication - sharing data, code, methods

Communication - sharing data, code, methods

Work in progress

  • incremental
  • always improving, learning
  • teaching and training, support

OHI Today

These tools and this workflow make our work possible.

  • December 8 2016: releasing 5th global assessment
  • Support and training for ~26 government or academic 'OHI+' assessments

All on ohi-science.org

My recommendations

Learn to program a different way

  • in a panic feeling empowered
  • for a single purpose thinking ahead
  • in isolation with a community

My recommendations

You've had a bit of exposure of these tools. Next: build confidence.

1 - Learn to code

    - in R
    - with RStudio

2 - Use version control

    - git
    - with GitHub
    - through RStudio

Introduce these concepts incrementally: evolution not revolution

Great resources

Exposure and confidence

And the secret of becoming a scientific programmer

–> TODO: quote from Woo et al